84 research outputs found

    Large scale evaluation of importance maps in automatic speech recognition

    Full text link
    In this paper, we propose a metric that we call the structured saliency benchmark (SSBM) to evaluate importance maps computed for automatic speech recognizers on individual utterances. These maps indicate time-frequency points of the utterance that are most important for correct recognition of a target word. Our evaluation technique is not only suitable for standard classification tasks, but is also appropriate for structured prediction tasks like sequence-to-sequence models. Additionally, we use this approach to perform a large scale comparison of the importance maps created by our previously introduced technique using "bubble noise" to identify important points through correlation with a baseline approach based on smoothed speech energy and forced alignment. Our results show that the bubble analysis approach is better at identifying important speech regions than this baseline on 100 sentences from the AMI corpus.Comment: submitted to INTERSPEECH 202

    Identifying, Evaluating and Applying Importance Maps for Speech

    Full text link
    Like many machine learning systems, speech models often perform well when employed on data in the same domain as their training data. However, when the inference is on out-of-domain data, performance suffers. With a fast-growing number of applications of speech models in healthcare, education, automotive, automation, etc., it is essential to ensure that speech models can generalize to out-of-domain data, especially to noisy environments in real-world scenarios. In contrast, human listeners are quite robust to noisy environments. Thus, a thorough understanding of the differences between human listeners and speech models is urgently required to enhance speech model performance in noise. These differences exist presumably because the speech model does not use the same information as humans for recognizing the speech. A possible solution is encouraging the speech model to attend to the same time-frequency regions as human listeners. In this way, speech model generalization in noise may be improved. We define those time-frequency regions that humans or machines focus on to recognize the speech as importance maps (IMs). In this research, first, we investigate how to identify speech importance maps. Second, we compare human and machine importance maps to understand how they differ and how the speech model can learn from humans to improve its performance in noise. Third, we develop a structured saliency benchmark (SSBM), a metric for evaluating IMs. Finally, we propose a new application of IMs as data augmentation for speech models, enhancing their performance and enabling them to better generalize to out-of-domain noise. Overall, our work demonstrates that we can improve speech models and achieve out-of-domain generalization to different noise environments with importance maps. In the future, we will expand our work with large-scale speech models and deploy different methods to identify IMs and use them to augment the speech data, such as those based on human responses. We can also extend the technique to computer vision tasks, such as image recognition by predicting importance maps for images and use IMs to enhance model performance to out-of-domain data

    Analysis of the railway heave induced by soil swelling at a site in southern France

    Get PDF
    International audienceIn order to better understand the heave observed on the railway roadbed of the French high-speed train (TGV) at Chabrillan in southern France, the swelling behaviour of the involved expansive clayey marl taken from the site by coring was investigated. The aim the study is to analyse the part of heave induced by the soil swelling. First, the swell potential was determined by flooding the soil specimen in an oedometer under its in-situ overburden stress. On the other hand, in order to assess the swell induced by the excavation undertaken during the construction of the railway, a second method was applied. The soil was first loaded to its in situ overburden stress existing before the excavation. It was then flooded and unloaded to its current overburden stress (after the excavation). The swell induced by this unloading was considered. Finally, the experimental results obtained were analyzed, together with the results from other laboratory tests performed previously and the data collected from the field monitoring. This study allowed estimating the heave induced by soil swelling. Subsequently, the part of heave due to landslide could be estimated which corresponds to the difference between the monitored heave and the swelling heave

    Reducing Geographic Disparities in Automatic Speech Recognition via Elastic Weight Consolidation

    Full text link
    We present an approach to reduce the performance disparity between geographic regions without degrading performance on the overall user population for ASR. A popular approach is to fine-tune the model with data from regions where the ASR model has a higher word error rate (WER). However, when the ASR model is adapted to get better performance on these high-WER regions, its parameters wander from the previous optimal values, which can lead to worse performance in other regions. In our proposed method, we utilize the elastic weight consolidation (EWC) regularization loss to identify directions in parameters space along which the ASR weights can vary to improve for high-error regions, while still maintaining performance on the speaker population overall. Our results demonstrate that EWC can reduce the word error rate (WER) in the region with highest WER by 3.2% relative while reducing the overall WER by 1.3% relative. We also evaluate the role of language and acoustic models in ASR fairness and propose a clustering algorithm to identify WER disparities based on geographic region.Comment: Accepted for publication at Interspeech 202

    Interregional Input-Output Analysis between the Mekong Delta Region (MDR) and the Rest of Vietnam (ROV)

    Get PDF
    The Mekong Delta is an important economic area, located in the southern part of Vietnam. The Mekong Delta has many potential and opportunities for development, but also new challenges in the context of global climate change, sea level rise, as well as the consequences of blocking the river and the Mekong countries also need to increase competition in international integration. In addition to these challenges, the region also has new opportunities when implementing economic restructuring in line with the policy of restructuring the economy in new conditions, including the establishment of special economic zones as PhuQuoc Resort. Besides analysis based on modern economic theory, this paper uses the input-output framework (I/O Inter-sect oral Scope Model) updated in 2016 for two areas: by the Mekong River and the Rest of Vietnam (ROV) to find inter-regional impacts and to calculate some impact assessments of climate change. The study also analyzes some other factors related to the viewpoint of sustainable regional development in new conditions, income distribution and social security

    Adaptive Endpointing with Deep Contextual Multi-armed Bandits

    Full text link
    Current endpointing (EP) solutions learn in a supervised framework, which does not allow the model to incorporate feedback and improve in an online setting. Also, it is a common practice to utilize costly grid-search to find the best configuration for an endpointing model. In this paper, we aim to provide a solution for adaptive endpointing by proposing an efficient method for choosing an optimal endpointing configuration given utterance-level audio features in an online setting, while avoiding hyperparameter grid-search. Our method does not require ground truth labels, and only uses online learning from reward signals without requiring annotated labels. Specifically, we propose a deep contextual multi-armed bandit-based approach, which combines the representational power of neural networks with the action exploration behavior of Thompson modeling algorithms. We compare our approach to several baselines, and show that our deep bandit models also succeed in reducing early cutoff errors while maintaining low latency
    • …
    corecore